11/20/23
Descriptive statistics/histograms/correlation matrix to visualize the spread of the data.
Preprocess and scale data for ease of comparability.
Libraries used: tidyverse, cluster, and factoextra.
The Euclidean distances are calculated to get the clustering distances measurement
The distance data is visualized using the fviz_dist() function in R from the factoextra package.
\[WCSS = \sum_{i=1}^{K}\sum_{j=1}^{ni}\left \| x_{ij} - c_{i} \right \|^2\]
The WCSS can be used to measure how the data within a cluster are grouped. The variables that are used from the dataset are then scaled. Then, after they are scaled the next steps are:
Euclidean distance is calculated between each attribute, and the cluster center: \[ D_{euclidean}(x, Ki) = \sqrt{\sum_{i=1}^{n}((x_{i}) -(k)_{ij})^{2}} \]
Where:
Data source: Spotify data from Kaggle
For our segmentation we will be using a spotify dataset that contains the audio features of each of track. There are 32,828 observations across 22 variables. There are 12 numerical variables and 10 character variables.
look into adding a data table. check example sent in discord
include snippets of data cleaning code. use stuff from discord
In the correlation plot, the darker the blue the greater the correlation between the variables. The chart shows a positive correlation between energy and loudness and a negative correlation between acousticness and energy.
We want to gain an understanding of which genre is the most popular among our datset. To start we split the tracks between popular (1) and unpopular (0) based on a popularity score of over 57 (based on a score of 0-99). We filter out all of the unpopular songs and build our graph by playlist genre. We see that pop, latin, and rap have the highest count of popular songs in our review.
In cluster 1, there are 2,292 records.
In cluster 2, there are 1,263 records.
In cluster 3, there are 3,010 records.
Centroid Center Positions:
The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS).
This measurement shows how well spread the clusters are between a value of 0 and 1.
The closer to 1, the more distinct the clusters are within the dataset.
In our model, the BSS/TSS ratio is 0.2042409, which is a pretty low ratio for this type of model. However, we determine a low number of clusters was sufficient in this model, which would also result in a low BSS/TSS ratio.
BSS/TSS Ratio: 0.2039852
We would like to see the number of popular songs in each cluster. Just like our previous analysis we take our spotify review set and see the number of popular tracks in each cluster based on our popularity score of >57. - In cluster 1, there are 492 popular tracks. - In cluster 2, there are 405 popular tracks. - In cluster 3, there are 1,241 popular tracks.
The graph compares the groups as we use this new only_pop data to see what type of popular tracks are in each cluster.
In our k means clustering model, cluster 1 has the highest duration_min, energy, instrumentalness, liveness, loudness, and tempo. These are fast and upbeat songs that include Blinding Lights, Crazy Train, and Sweet Child of Mine. The majority of genres are pop, rock, and EDM, which make up our most popular groups. These sounds are low on acousticness, dancebilitity, speechiness, and valence. This cluster also has the lowest number of popular tracks.
In our k means clustering model cluster 2 has the highest danceability, speechiness, valence, and the highest number of popular songs for our clusters. These are cheerful and vocal songs that include Memories, Falling, and everything I wanted as its top songs. The majority of genres are pop, latin, and rock. These sounds are low on duration_min, dancebilitity and instrumentalness.
Cluster 3
In our k means clustering model cluster 3 has the highest acousticness. These are soft rock or acoustic songs that include Roxanne, The Box, and Circles. These sounds are low on energy, liveness, loudness, and tempo. This cluster is very close to having the lowest number of popular track, and the top genres are latin, rap, and pop.
K-Means provides a simple yet insightful way to glean data insights.
Unsupervised machine learning finds hidden data structures from an unlabeled dataset and its aim is to find the similarities within the data groups.
Three cluster groups were found.
We can offer song recommendations based on this analysis
We recommend using other K-Means clustering methods to verify whether results are similar.